Math Spotting in Technical Documents Using Handwritten Queries
نویسندگان
چکیده
Content-based image retrieval (CBIR) for documents has been studied for a long time [3]. It focuses on indexing and retrieval capabilities in a large set of document images and provides a convenient way for people to access the information in these images, which means retrieving document images by their visual features (color,shape,texture,etc) so converting them to electronic formats can be avoided. As a sub-problem, word spotting means spotting a specific word or a set of words in a large document images database. In our research, we are concentrating on spotting mathematical symbols instead of words so its called math spotting. A new method will be proposed based on several current technologies and applied for the spotting problem. Over the years, different features have been extracted from images and various similarity measures have been tried. A hierarchical representation of the page layout called X Y tree has been usually used to retrieve query image by comparing tree structure [1]. Based on pixel projection, the X-Y tree can be gotten by X-Y cutting (Figure 1b), which cuts the page in vertical and horizontal directions alternatively. Figure 1 shows the process of getting a X-Y tree of a document segment. As shown in Figure 1b, X-Y tree cutting divides the page into a lot of red component boxes, each of which corresponds to one node in X-Y tree (Figure 1c). At the same time, the minimum component boxes that don’t contain any other component boxes represent the leaves in the tree. Fro example, Figure 2a shows a simple query which has been first drawn by pen in blank papers then scanned, the X-Y cutting and X-Y tree for it are shown in 2b and 2c respectively. The minimum component boxes that contain “x”,“+”,“y”,“-”,“2” response to the five leaves in the X-Y tree exactly. Considering that different symbols in the documents may have different X-Y tree representations, [2] sheds light on a possible way to retrieve math symbols in a document images database, although it hasn’t been tried on spotting problem yet. After getting X-Y tree, the problem of spotting math notations reduces to the problem of subtree matching. Figure 2c shows a X-Y tree for the query image, and if we can find a sub-tree in Figure 1c which is the same as or similar to the query X-Y tree, it may provide an effective evidence that the query are included in the page. Many different sub-tree isomorphism algorithm have been studied recently. Considering the running time requirement and the fact that the symbols are located in the leaves of X-Y tree, a button-up algorithm [5] with liner time in the size of the trees is preferred. Three steps are included:
منابع مشابه
Connected Component Based Word Spotting on Persian Handwritten image documents
Word spotting is to make searchable unindexed image documents by locating word/words in a doc-ument image, given a query word. This problem is challenging, mainly due to the large numberof word classes with very small inter-class and substantial intra-class distances. In this paper, asegmentation-based word spotting method is presented for multi-writer Persian handwritten doc-...
متن کاملAdaptation des caractéristiques pseudo-Haar pour le word spotting dans les documents manuscrits
This paper addresses the problem of word spotting in handwritten documents. We propose a coarse-to-fine segmentation free approach. This approach is based on two filtering phases, which are a global filtering followed by a local filtering after changing the observation scale. The contribution of this work is the use and the adaptation of the Haarlike-features in word spotting task for each test...
متن کاملRadial Line Fourier Descriptor for Segmentation-free Handwritten Word Spotting
Automatic recognition of historical handwritten manuscripts is a daunting task due to paper degradation over time. Recognition-free retrieval or word spotting is popularly used for information retrieval and digitization of the historical handwritten documents. However, the performance of word spotting algorithms depends heavily on feature detection and representation methods. Although there exi...
متن کاملSegmentation-Based And Segmentation-Free Methods for Spotting Handwritten Arabic Words
Given a set of handwritten documents, a common goal is to search for a relevant subset. Attempting to find a query word or image in such a set of documents is called word spotting. Spotting handwritten words in documents written in the Latin alphabet, and more recently in Arabic, has received considerable attention. One issue is generating candidate word regions on a page. Attempting to definit...
متن کاملSegmentation-free Word Spotting for Handwritten Arabic Documents
6 Abstract — In this paper we present an unsupervised segmentation-free method for spotting and searching query, especially, for images documents in handwritten Arabic, for this, Histograms of Oriented Gradients (HOGs) are used as the feature vectors to represent the query and documents image. Then, we compress the descriptors with the product quantization method. Finally, a better representati...
متن کامل